What is it?

A way to subdivide a giant neural network into sub-networks which can be expertly trained on one aspect of the problem and then sparsely combined back together for each task.

This means that during application, instead of using all the parameters of the model for any input the MoE model would select different parameters for each input. This also means that the MoE model has an outrageous number of parameters which are sparsely activated.

This idea was introduced in 1991 and while it made theoretical sense it was never practical because of several reasons:

GPUs are faster at arithmetic than at branching
Network bandwidth is a bottleneck between GPU clusters
Large data sets are needed to train the multi-million to billion parameters
The huge number of parameters is prone to training instability

Architectures

One architecture design is the "sparsely-gated mixture-of-expert layer". It uses the a sparse gating function to select two experts (two sub-networks) to perform the computation
Google developers introduced a module to scale the Transformer model with sparsely-gated MoE architecture back in 2006 with GShard
And then in 2022 introduced the Switch Transformer to simplify the MoE routing algorithm